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Abstract 

This paper proposes a new framework for 
the eye centers localization by the joint use 
of encoding of normalized image projections 
and a Multi Layer Perceptron (MLP) clas¬ 
sifier. The encoding is novel and it consists 
in identifying the zero-crossings and extract¬ 
ing the relevant parameters from the result¬ 
ing modes. The compressed normalized pro¬ 
jections produce feature descriptors that are 
inputs to a properly-trained MLP, for dis¬ 
criminating among various categories of im¬ 
age regions. The proposed framework forms 
a fast and reliable system for the eye centers 
localization, especially in the context of face 
expression analysis in unconstrained environ¬ 
ments. We successfully test the proposed 
method on a wide variety of databases in¬ 
cluding BioID, Cohn-Kanade, Extended Yale 
B and Labelled Faces in the Wild (LFW) 
databases. 


1. Introduction 


As noted in the review on the eye localization topic, 
” eye detection and tracking remains challenging due to 
the individuality of eyes, occlusion, variability in scale, 
location, and light conditions” (Hansen & Ji : 2010). 
Eye data and details of eye movements have numer¬ 


ous applications in face detection, biometric identifica¬ 
tion, and particularly in human-computer interaction 
tasks. Among the various applications of the eye lo¬ 
calization topic, we are particularly interested in face 
expression analysis. Thus, while any method is sup¬ 
posed to perform accurately enough on the real-life 
cases and be fast-enough for real-time applications, we 
show an additional interest in the cases where eye cen¬ 
ters are challenged by face expression. We will prove 
that the proposed method, which uses a MLP to dis¬ 
criminate among encoded normalized image projec¬ 
tions from patches centered on the eye and, respec¬ 
tively, from patches shifted from the eye, is both ac¬ 
curate and fast. 


1.1. Related work 


The problem of eye localization was well investigated 
in literature, within a long history ( |Hansen fc Ji 2010). 
Methods for eye center (or iris or pupil) localization 
in passive, remote imaging may approach the problem 
either as a particular case of pattern recognition appli¬ 


cation, (Hamouz et ah, 2005), (Asteriadis et ah, 2009) 


or by using the physical particularities of the eye, like 


the high contrast to the neighboring skin (Wu & Zhou 


2003) or the circular shape of the iris (Valenti & Gev- 


2008). The proposed method combine a pattern 


recognition approach with features that make use of 
the eye’s high contrast. 

One of the first eye localization attempts is in the work 


from (Kanade 1973), who used image projections for 


this purpose. Taking into account that our method 
also uses image projections for localization, in the next 
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paragraphs we shall present state of the art methods 
by going from the conceptually closest to the wider 
categories. Namely we shall start by presenting solu¬ 
tions based on projections, to follow with general eye 
localization methods and face fiducial points localiza¬ 
tion algorithms. 

As a general observation, we note that while older solu¬ 


General eye localization methods There are 
many other approaches to the problem of eye localiza¬ 
tion. 2000 propose a face matching method based on 
the Hausdorff distance followed by a MLP eye finder. 
|2003| even reversed the order of the typical procedure: 
they use eye contrast specific measures to validate pos¬ 
sible face candidates. 


tions (Jesorsky et ah, 2000), (Wu & Zhou 2003), tried 


to estimate also the face position, since the appear- Responses algorithm for feature localization. 2006 


2004] rely on the Pairwise Reinforcement of Feature 

use 


& Jones 2004), eye center search is limited to a sub¬ 


ance of the Viola-Jones face detection solution (Viola SVM on optimally selected Haar wavelet coefficients. 


area within the upper face square. 


Projections based methods The same image pro¬ 
jections as in the work of Kanade are used to ex¬ 
tract information for eye localization in a plethora of 
methods (Feng & Yuen 1998), (Zhou & Geng 2004), 


(Turkan et ah, 2004| ) . ( |Feng fe Yuen , 1998) start 


with a snake based head localization followed by an¬ 
thropometric reduction (relying on the measurements 
from (Verjak & Steph ancic[ |1994| )) to the so-called eye- 
images and introduce the variance projections for lo¬ 
calization. The key points of the eye model are the 
projections particular values, while the conditions are 
manually crafted. 


(Zhou & Geng, 2004) describe convex combinations 


between integral image projections and variance pro¬ 
jections that are named generalized projection func¬ 
tions. These are filtered and analyzed for determin¬ 
ing the center of the eye. The analysis is also manu¬ 
ally crafted and requires identification of minima and 
maxima on the computed projection functions. Yet in 
specific conditions, such as intense expression or side 
illumination, the eye center does not correspond to a 


minima or a maxima in the projection functions. (Liu 


et ah, 2010) use similar conditions with the ones used 


in (Zhou & Geng, 2004) but applied solely on the in¬ 


tegral projections to detect if an eye is open or closed. 


|2004| introduce the edge projections and use them to 
roughly determine the eye position. Given the eye re¬ 
gion, a feature is computed by concatenation of the 
horizontal and vertical edge image projections. Sub¬ 
sequently, a SVM-based identification of the region 
with the highest probability is used for marking the 


eye. The method from (Turkan et ah, 2004) is, to our 


best knowledge, the only one, except ours which uses 
image projections coupled with machine learning. Yet 
we differ by using supplementary data coupled with 
the introduction of efficient computation techniques 
and elaborated pre and post-processing steps to keep 
the accuracy high and the running time low. 


2005 refine with SVM the Gabor filtered faces, for lo¬ 
cating 10 points of interest; yet the overall approach is 
different from the face feature fiducial points approach 
that is discussed in the next paragraph. |2006| use an 
iteratively bootstrapped boosted cascade of classifiers 
based on Haar wavelets. 120071 use multi scale Gabor 
jets to construct an Eye Model Bunch. 2009 use the 


distance to the closest edge to describe the eye area. 
( Valenti fc Gevers] |2012| ) use isophote’s prop- 


2008 


erties to gain invariance and follow with subsequent 
filtering with Mean Shift (MS) or nearest neighbor 
on SIFT feature representation for higher accuracy. 


2010 relies on thresholding the cumulative histogram 


for segmenting the eyes. |2010| train a set of classi¬ 
fiers to detect multiple face landmarks, including ex¬ 
plicitly the pupil center, by using a sliding window 
approach and test in all possible locations and inter¬ 
connect them to estimate the shape overall. 2011 rely 
their eye localizer on gradient techniques and search 
for circular shapes. |2013| use an exhaustive set of simi¬ 
larity measures over basic features such as histograms, 
projections or contours to extract the eye center lo¬ 
cation having in mind the specific scenario of driver 
assistance. 


Face fiducial points localization More recently, 
motivated by the introduction by |2001| of the active 
appearance models (AAM), that simultaneously de¬ 
termine a multitude of face feature points, a new class 
of solutions, namely the localization of face fiducial 
points appeared. In this category we include the al¬ 
gorithm of [2005] who use a GentleBoost algorithm for 
combining Gabor filters extracted features; 2008, who 
extend the original active shape models with more 
landmark points and stacks two such models; 2012, 
who model shapes using the Markov Random Field 
and classify them using SVM in the so-called Borman 
algorithm; |2011 who use Bayesian inference on SIFT 
extracted features and most recently [2012 who use a 
combination of regularized boosted classifiers and mix¬ 
ture of complex Bingham distributions over texture 
and shape related features. 
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1.2. Paper Structure 

In this paper we propose a system for eye centers local¬ 
ization that starts with face detection and illumination 
type detection, followed by a novel feature extractor, a 
MLP classifier for discriminating among possible can¬ 
didates and a post-processing step that determines the 
eye centers. We contribute by: 

• Describing a procedure for fast image projections 
computation. This step is critical in having the 
solution run in real time. 

• Introducing a new encoding technique to image 
analysis domain. 

• The combination of normalized image projections 
with zero-crossing based encoding results in image 
description features named Zero-crossing based 
Encoded image Projections (ZEP). They are fast, 
simple, robust and easy to compute and therefore 
have applicability in a wider variety of problems. 

• The integration of the features in a framework for 
the problem of eye localization. We will show that 
description of the eye area using ZEP leads to 
significantly better results than state of the art 
methods in real-life cases represented by the ex¬ 
tensive and very difficult Labeled Faces in the 
Wild database. Furthermore, the complete sys¬ 
tem is the fastest known in literature among the 
ones reporting high performance. 

The remainder of this paper is organized as follows: 
Section [2] reviews the concepts related to Integral Pro¬ 
jections and describes a fast computation method for 
them; Section [3] summarizes the encoding procedure 
and the combination with image projections to form 
the ZEP features. The paper ends with implementa¬ 
tion details, with a discussion on the achieved results 
in the field of eye localization and proposals for further 
developments. 
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The integral projections reduce the dimensionality of 
data from 2D to ID, describing it up to a certain level 
of details. Also, the projections can be computed on 
any orthogonal pair of axes, not necessarily rows and 
columns. This will be further discussed in subsection 

m 


2.2. Edge Projections 

Over time, several extensions of the integral projec¬ 
tions have been introduced such as variance projec¬ 
tion functions ( |Feng fc Yuen 1998) or edge projection 
functions (EPF) ( |Turkan et al. 2004). 


Instead of determining edges with wavelet transform as 


in the case of (Turkan et ah, 2004), we use a different 


approach for computing the edge projections. First, 
the classical horizontal and vertical Sobel contour op¬ 
erators (for details see (Gonzalez & Woods, 2001) sect. 
3.7) are applied, resulting in Sh and Sy which are 
combined in the S(i,j) image used to extract edges: 


S(i,j) = S 2 H (i,j) + Sy(i,j) (3) 

The edge projections are computed on the correspond¬ 
ing image rectangle 


2. Image Projections 


2.1. Integral Image Projections 

The integral projections, also named integral projec¬ 
tion functions (IPF) or amplitude projections, are 
tools that have been previously used in face analy¬ 


sis. They appeared as “amplitude projections” (Becker 


et al. 11) or as “integral projections” (Kanade, 1973) 
for face recognition. For a gray-level image sub- 
window I(i,j) with i = and j = j n ...j N , 

the projection on the horizontal axis is the average 
gray-level along the columns ([!]), while the vertical 


EhU) = 5 




E S (h3)^j=3n,---,3N 
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Equations © and © are simply equations 0 and 0 
applied on the Sobel edge image S(i,j). 
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As Sobel operator is invariant to additive changes, if 
compared to other types of projections, the edge pro¬ 
jections are significantly more stable with respect to 
illumination changes. 


2.3. Fast Computation of Projections 


While sums over rectangular image sub-windows may 
be easily computed using the concept of summed area 


tables (Crow, 1984) or integral image (Viola & Jones 


2004), a fast computation of the integral image projec¬ 


tions may be achieved using the prefix sums (Blelloch 


1990) on rows and respectively on columns. A prefix- 


sum is a cumulative array, where each element is the 
sum of all elements to the left of it, inclusive, in the 
original array. They are the ID equivalent of the inte¬ 
gral image, but they definitely precede it as recurrence 
Xi = a,i + Xi -1 is known for many years. 


For the fast computation of image projections, two ta¬ 
bles are required: one will hold prefix sums on rows 
(a table which, for keeping the analogy with integral 
image, will be named horizontal ID integral image) 
and respectively one vertical ID integral image that 
will contain the prefix sums on columns. It should be 
noted that computation on each row/column is per¬ 
form separately. Thus, if the image has M x N pixels, 
the ID horizontal integral image, on the column j, X ° H , 
is: 


4w = E 7 ( fc ^') = ( 6 ) 

k =1 


Horizontal integral projection Vertical integral projection 
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Figure 1. Given the two ID oriented integral images, X J H 
and Xy, each element of the integral projections, Ph and 
Pv that describes the marked sub-window is found by a 
simple subtraction. 


S(i,j). This image needs to be computed on the areas 
of interest. 

In conclusion, the fast computation of projections 
opens the direction of real-time feature localization on 
high resolution images. 

3. Encoding and ZEP Feature 

To reduce the complexity (and computation time), the 
projections are compressed using a zero-crossing based 
encoding technique. After ensuring that the projec¬ 
tions values are in a symmetrical range with respect 
to zero, we will describe, independently, each interval 
between two consecutive zero-crossings. Such an in¬ 
terval is called an epoch and for its description three 
parameters are considered (as presented in figure [2]): 


Thus, the horizontal integral projection corresponding 
to the rectangle i = x [j n ; j N \ is: 


PhU) = , _) (zh(iM)-Zh(i m - 1)) (7) 

The procedure is visually exemplified in figure [l] 

Using the oriented integral images, the determina¬ 
tion of the integral projections functions on all sub¬ 
windows of size K x L in an image of M x N pixels 
requires one pass through the image and 2 x M x N 
additions, 2 x (M — K) x (N — L) subtractions and 
two circular buffers of (K + 1) x (N + 1) locations, 
while the classical determination requires 2 x K x L x 
(M — K) x (N — L) additions. Hence, the time to 
extract the projections associated with a sub-window, 
where many sub-windows are considered in an image, 
is greatly reduced. 


• Duration - the number of samples in the epoch; 

• Amplitude - the maximal signed deviation of the 
signal with respect to 0; 

• Shape - the number of local extremes in the epoch. 


The proposed encoding is similar with the TESPAR 
(Time-Encoded Signal Processing and Recognition) 
technique (King & Phipps 1999) that is used in the 
representation and recognition of ID, band-limited, 
speech signals. Depending on the problem specifics, 
additional parameters of the epochs may be considered 
(e.g. the difference between the highest and the lowest 
mode from the given epoch). Further extensions are at 
hand if an epoch is considered the approximation of a 
probability density function and the extracted param¬ 
eters are the statistical moments of the said distribu¬ 
tion. In such a case the shape parameter corresponds 
to the number of modes of the distribution. 


The edge projections require the computation of the The reason for choosing this specific encoding is two- 
oriented integral images over the Sobel edge image, fold. First the determination of the zero-crossings and 
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independence of the ZEP feature with respect to uni¬ 
form variation of the illumination. The normalization 
with respect to the number of elements in the image 
sub-window leads to partial scale invariance: horizon¬ 
tal projections are invariant to stretching on the ver¬ 
tical direction and vice versa. The scale invariance 
property of the ZEP feature is achieved by completely 
normalizing the encoded durations to a specific range 
(e.g. the encoded horizontal projection becomes in¬ 
variant to horizontal stretching after duration normal¬ 
ization). We stress that when compared with previous 
methods based on projections, which lack the normal¬ 
ization steps, the hereby proposed algorithm increases 
the overall stability to various influences. 


Figure 2. Example of ID signal (vertical projection of an 
eye crop) and the associated encoding. There are three 
epochs, each encoded with three parameters. The associ¬ 
ated code is: [4,114,1; 18,-128,3; 14,127,2] 


the computation of the parameters is doable in a single 
pass through the target ID signal, and, secondly, the 
epochs have specific meaning when describing the eye 
region, as discussed in the next subsection. 

Given an image sub-window, the ZEP feature is de¬ 
termined by the concatenation of four encoded projec¬ 
tions as described in the following: 

1. Compute both the integral and the edge projec¬ 
tion functions (Ph, Pv, Eh, Ey ); 

2. Independently normalize each projection within a 
symmetrical interval. For instance, in our appli¬ 
cation we normalized each of the projections to 
the [—128; 127] interval. This will normalize the 
amplitude of the projection; 

3. Encode each projection as described; allocate for 
each projection a maximum number of epochs; 

4. Normalize all other (i.e. duration and shape) en¬ 
coding parameters; 

5. Form the final Zero-crossing based Encoded image 
Projections (ZEP) feature by concatenation of the 
encoded projections. Given an image rectangle, 
the ZEP feature consists of the epochs from all 
the 4 projections: ( Pr, Py, Eh, Ey). 

Image projections are simplified representations of the 
original image, each of them carrying specific infor¬ 
mation; the encoding simplifies even more the image 
representation. The normalization of the image pro¬ 
jections, and thus of the epochs amplitudes, ensures 


3.1. ZEP on Eye Localization 


As noted, image projections have been used in mul¬ 
tiple ways for the problem of eye localization. In an 
exploratory work, 1973] determined the potential of im¬ 
age projections for face description. More recently, 
1998| , |2004| and |2004| presented the use of the integral 


projections and/or their extensions for the specific task 
of eye localization. Especially in (Zhou & Geng 2004) 
it was noted that image projections, in the eye region 
have a specific sequence of relative minima and max¬ 
ima assigned with to skin (relative minimum), sclera 
(relative maximum), iris (relative minimum), etc. 


Considering a rectangle from the eye region including 
the eyebrow (as showed in figure [3] (a) ), the associated 
integral projections have specific epochs, as showed in 
figure [3] (c) and (d). The particular succession of pos¬ 
itive and negative modes is precisely encoded by the 
proposed technique. On the horizontal integral projec¬ 
tion there will be a large (one-mode) epoch that is as¬ 
signed to skin, followed by an epoch for sclera, a triple 
mode, negative, epoch corresponding to the eye center 
and another positive epoch for the sclera and skin. On 
the vertical integral projection, one expects a positive 
epoch above the eyebrow, followed by a negative epoch 
on the eyebrow, a positive epoch between the eyebrow 
and eye, a negative epoch (with three modes) on the 
eye and a positive epoch below the eye. 


The ZEP feature, due to invariance properties already 
discussed, achieves consistent performance under var¬ 
ious stresses and is able to discriminate among eyes 
(patches centered on pupil) and non-eyes (patches cen¬ 
tered on locations at a distance from the pupil center). 
As explained in section [4] 2, on the validation set, us¬ 
ing Fisher linear discriminant over 90% correct eye de¬ 
tection rate is achieved by selecting patches that are 
centered on the pupil with respect to the ones that are 
shifted. 
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(a) (b) 



Figure 3. Image projections from a typical eye patch: (a) 
face crop, (b) eye crop, (c) integral horizontal image projec¬ 
tion on the eye crop, (d) integral vertical image projection 
on the eye crop. On the horizontal projection the double 
line marks the zero crossing that is found on all eye exam¬ 
ples, while the rest of zero-crossings may be absent in some 
particular cases. 


Figure 4. The work flow of the proposed algorithm. 


4. Implementation 

The block schematic of our eye center localization al¬ 
gorithm is summarized in figure [4j In the first step, 
a face detector (the cascade of Haar features (Viola & 


Jones, 2004) delivered with OpenCV) automatically 


determines the face square. Next the regions of in¬ 
terest are set in the upper third of the detected face: 
from 26% to 50% of the face square on rows, respec¬ 
tively from 25% to 37% on columns for the left eye and 
from 63% to 75% on columns for the right eye. 

Noting the susceptibility of the image projections to 
alter their shape due to lateral illumination, we intro- 
/ \ duced a simple method for detecting such a case and 
we adapt the algorithm to the type of illumination 
found. After a very simple preprocessing, the ZEP 
features for each possible location are computed and 
feed to a classifier to identify the possible eye locations. 
The possible eyes are then post-processed and the best 


positions are located as discussed in subsection 4.3 


Regarding the face detection, the recent solutions use 
multiple cascades for not only identification of the face 
rectangle, but also for determination of the in-plane 
(roll) and yaw (frontal/profile) angles of head. Such 
procedure follows Viola and Jones extension of the ini- 
^ tial face detector work flJones fc Viola 2003), (Ramirez 


& Fuentes, 2008). Thus, it is customary to limit the 


analysis of “frontal faces” to a maximum rotation of 
30°. 


4.1. Lateral Illumination Detection 

To increase the solution robustness to lateral illumina¬ 
tion, we automatically separate such cases. The moti¬ 
vation for the split lies in the fact that side illumina¬ 
tion significantly alters the shape of the projections in 
the eye region, thus decreasing the performance of the 
classification part. 


The lateral illumination detection relies on computing 
the average values on the eye patch previously selected. 
The following ratios are considered: 
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Table 1. Percentage of frontal illumination detected cases 
on the Extended Yale B database. _ 


Azimuth 

±[40° : 130°] 

±[0° : 35°] 

Elevation 

1 

o 

o 

O 

o 

26.11% 

88.28% 

10° : 90° 

17.26% 

36.35% 


_ Ltop . TD _ R top . TT _ Ltop + Lbot m 

ratio — 7-5 ratio — 7?-5 ratio — ~f> -,77-5 

J^bot ^bot -Ettopn - Tb bot 

_ _ (8) 

where ( L top ) and (L^ot) are the average gray levels on 
the upper and lower halves of the left eye and ( R top 
and Rbot) are their correspondents on the right eye. 

The lateral illumination case is considered if any of the 
computed ratios, L rat i 0 , R ra tio , Hratio , is outside the 
[0.5; 1.75] range. 

We designed this block such that an illumination that 
do not produce significant shadows on the eye region is 
detected as frontal, and as lateral otherwise. In terms 
of illumination angle, the cases with shadows on the 
eye region imply an absolute value of azimuth angle 
higher than 40° or an elevation angle value higher than 
10°. Negative elevation (light from below) with low 
azimuth value does not produce shadows on the eye 
region. The interval [0.5; 1.75] mentioned above has 
been found by matching the mentioned cases with the 
ratios values on the training database. 

Indeed 98.54% of the images from the BioID database 
are detected as frontal illuminated, while the results on 
Extended Yale B are presented in table [l] Extended 
Yale B database has images with various illumination 
angles as it will be discussed in section [5| 5. 

4.2. Training 

Once the ZEP features are determined, the extracted 
data is feed into a Multi-Layer Perceptron (MLP), hav¬ 
ing one input layer, one hidden layer and one output 
layer, trained with the back-propagation algorithm. In 
our implementation, the number of neurons from the 
hidden layer is chosen to be half the size of the ZEP 
feature as it was empirically determined as a reason¬ 
able trade-off between performance (higher number of 
hidden neurons) and speed. 

In the preferred implementation, each projection is en¬ 
coded with 5 epochs, leading to 60 elements in the ZEP 
feature (and 60 inputs to the MLP). If more epochs are 
provided by projections (which is very unlikely for eye 
localization - less than 0.1% in the tested cases), the 
last ones are simply removed. 



Figure 6. Localization of the image patch centers used as 
positive examples (white) and negative examples (black). 
The right hand image is zoomed from the left hand one. 


The training of the MLP is performed with crops of 
eyes and non-eyes of 71 x 71 pixels, as shown in figure 
[5j while the preferred face size is 300 x 300 pixels. The 
positive examples are taken near the eye ground truth: 
the eye rectangle overlaps more than 75% with the 
true eye rectangle. The patches corresponding to the 
negative examples overlap with the true eye between 
50% and 75%, thus leading to a total of 25 positive 
examples and 100 negative ones from each eye in a 
single face image. Positive and negative locations are 
showed in figure [6] This specific choice of positive 
and negative examples yields to high performance in 
localization. 


In total there were 10,000 positive examples and as 
much negative ones, taken from the authors’ Eye 
Chimera database, from the Georgia Tech database 
(Nefian & Hayes, 2000) and from the neutral poses se¬ 
lected from the YaleB database ( Georghiades et ah] 
2001). We have considered two training variants cor¬ 


responding to the two types of illumination (frontal or 
lateral). 


One training procedure uses images from our data set 
(40%) and from Georgia Tech (60%) and focuses on 
frontal illumination, eye expressions and occlusions. In 
this case, the MLP is trained to return the L2 distance 
from a specific patch center to the true eye center. 
Thus the MLP performs regression. 


The second training procedure (for lateral illumina¬ 
tion) uses images only from the frontal pose of the 
Yale B database and it is used for improving perfor¬ 
mance against illumination. In this case the training 
set was labelled with —1 (non-eye - between 50% and 
75% overlapping onto the centered eye patch) or 1 (eye 
- more than 75% overlapping). 


As many machine learning algorithms are available, we 
have performed a short study on examples extracted 
from the training databases. Given the number of im¬ 
ages in the databases, 20,000 examples were used for 
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Figure 5. Examples of eye (first three images) and non-eye (last three images) crops; it should be noted that the examples 
include eye expressions and glasses 


Table 2. Details of the two merged solutions of the actual 


eye localization 


Case 

Frontal 

illuminatior 

Lateral 

l illumination 

Darkness 

preprocess. 

threshold 

0.15 

0.3 

Training 

database 

GeorgiaTech 
+ Authors’ 

Yale B 

Training 

scheme 

Regression 

L2 dist 

True/False 

ZEP+MLP 

threshold 

0 

-0.5 

eye area 

selection 

largest 
lower region 

largest region 

eye center 
localization 

weighted 
center of 

mass 

geometrical center 
of the rectangle 
circumscribed to 
the eye region 


training the networks and approximately 200,000 were 
used for classifier validation. For the classification 
problem, a Support Vector Machine (SVM) produced 
93.7% correct detection rate, the used MLP 92.6% and 
an ensemble of 50 bagged decision trees 91.5% detec¬ 
tion rate. For the regression case, a SVM for Regres¬ 
sion lead to an approximation error of 0.090, the re¬ 
gression MLP 0.096 and bagged ensemble of regression 
trees only 0.115. Taking into account the achieved 
values, there is no significant performance difference 
among the various machine learning systems tested 


(conclusion which matches the findings from (Evering- 
ham & Zisserman, 2006)), thus our decision on using 
the MLP was based more on speed issues. 


4.3. Preprocessing and Postprocessing 

The conceptual steps in both illumination cases of the 
actual eye localization procedure are the same: pre¬ 
processing, machine learning and postprocessing. 

A simple preprocessing is applied for each eye candi¬ 
date region to accelerate the localization process. Fol¬ 


lowing 2003 we note that the eye center (associated 


with the pupil) is significantly darker than the sur¬ 
rounding; thus the pixels that are too bright with re¬ 
spect to the eye region (and are not plausible to be eye 
centers) are discarded. The “too bright” characteris¬ 
tic is encoded as gray-levels higher than a percentage 
(so called darkness preprocessing threshold in table [2| 
from the maximum value of the eye region. In the lat¬ 
eral illumination case, this threshold is higher due to 
the deep shadows that can be found on the skin area 
surrounding the eye. 

In the area of interest, using a step of 2 over a slid¬ 
ing image patch of 71 x 71 pixels, we investigate by 
the proposed ZEP+MLP all the plausible locations. 
We consider as positive results the locations where the 
value given by the MLP is higher than an experimen¬ 
tally found threshold (see table [2]). These positive re¬ 
sults are recorded in a separate image (the ZEP image, 
shown in figure [ 7 ]) which is further post-processed for 
eye center extraction. 

Since closed eyes (that were included in the training 
set) are similar with eyebrows, one may get false eye 
regions given by the eyebrow in the ZEP image. Thus 
the ZEP image is segmented, labelled and the lowest 
and largest regions are associated with the eye. This 
step will discard, for instance, the regions given by the 
eyebrow in figure [7] (c). 

For the frontal illumination case, due to training with 
L2 distance as objective, one expects a symmetrical 
shape around the true eye center. Thus the final eye 
location is taken as the weighted center of mass of the 
previously selected eye regions. For the lateral illumi¬ 
nation, the binary trained MLP is supposed to localize 
the area surrounding the eye center and the final eye 
center is the geometrical center of the rectangle cir¬ 
cumscribed to the selected region. We note that in 
both cases, the specific way of selecting the final eye 
center is able to deal with holes (caused by reflections 
or glasses) in the eye region. 

An overview of how each step is implemented in the 
two illumination cases considered is shown in table [21 
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(a) (b) (c) (d) 


Figure 7. The ZEP image ((a) and (c)) resulting after pro¬ 
cessing with ZEP+MLP (so called ZEP image) the original 
eye patch image (b), (d) (under frontal illumination case). 
The first two images (a), (b) show a simple case (a) when 
only patches around eye center were selected by the MLP. 
The right hand two images, (c) and (d), contain a more 
difficult case (c) where also the eyebrow was identified as 
possible eye region. Higher gray-levels in the ZEP image 
signal higher confidence, as estimated by MLP, in having 
the eye center at that specific location. 


5. Results and Discussions 


We will discuss first the influence of various system pa¬ 
rameters onto the overall results. For this purpose we 
will use the BioID databas^B This database contains 
1521 gray-scale, frontal facial images of size 384 x 286, 
acquired with frontal illumination conditions in a com¬ 
plex background. The database contains 16 tilted and 
rotated faces, people that wear eye-glasses and, in very 
few cases, people that have their eyes shut or pose vari¬ 
ous expressions. The database was released with anno¬ 
tations for iris centers. Being one of the first databases 
that provided facial annotations, BioID became the 
most used database for face landmarks localization ac¬ 
curacy tests, even that it provides limited variability 
and reduced resemblance with real-life cases. We will 
use BioID as a starting point in discussing the achieved 
results (for giving an inside on the system’s various 
parameters and selecting the most performing state of 
the art systems) so that later to continue the evalu¬ 
ation under other stresses like eye expression, illumi¬ 
nation angle or pose. Yet the most relevant test is 
on real-life cases, which are acquired in the Labelled 
Faces in Wild database that will be presented later on. 


The localization performance is evaluated according 


to the stringent localization criterion (Jesorsky et al. 


2000). The eyes are considered to be correctly de¬ 


termined if the specific localization error e, defined in 
equation § is smaller than a predefined value. 


_ ma x{£ L ,£r} 

D e ye 

2 http://www.bioid.com/downloads/software/ 
bioid-face-database.html 


( 9 ) 


Table 3. Percentage of correct eye localization of the pro¬ 
posed algorithm (integral - IPF and edge projection func¬ 
tions - EPF) compared with IPF-only and EPF-only im¬ 
plementations 


Projection Type 

IPF+EPF 

IPF 

EPF 

Accuracy, e < 0.05 

70.46 

56.61 

53.17 

Accuracy, e < 0.10 

91.94 

87.58 

84.02 

Accuracy, e < 0.25 

98.87 

98.81 

96.50 


In the equation above, er is the Euclidean distance 
between the ground truth left eye center and deter¬ 
mined left eye center, Sr is the corresponding value 
for the right eye, while D eye is the distance between 
the ground truth eyes centers. Typical error thresh¬ 
olds are e = 0.05 corresponding to eyes centers found 
inside the true pupils, e = 0.1 corresponding to eyes 
centers found inside the true irises, and e = 0.25 corre¬ 
sponding to eyes centers found inside the true sclera. 
This criterion identifies the worst case scenario. 

We note that, while the BioID image size leads to 
approximately a 40 x 40 size for the eye patch, be¬ 
cause our target are HD video frames (for which we 
will also provide duration), we upscale the face square 
to 300 x 300, thus having an eye square of 71 x 71. 

The results on the BioID database are shown in fig¬ 
ure [8] (a), where we represented the maximum (better 
localized eye), average and minimum (worst localized 
eye) accuracy with respect to various values of the e 
threshold. 


5.1. The Influence of ZEP Parameters 


We investigated the performance of the proposed sys¬ 
tem when only one type of projection is used. The 
results are presented in table [3j The computation 
time dropped to « 53% of the full algorithm time if 
only one projection type is used. The performance 
drops with ~ 14% in the case of integral projection 
and with « 17% in the case of edge projections. Us¬ 
ing the proposed encoding it is possible to keep both 
the dimensionality of the feature and the time dura¬ 
tion low enough in order to use more than one type of 
projection. This supplementary information helps to 
increase the results accuracy when compared with the 
method in (Turkan et al., 2004). 


Alternatives to the eye crop size and resulting values 
are presented in table [4j The experiment was per¬ 
formed by re-training the MLP with eye crops of the 
target size. As one can notice, the results are similar, 
thus proving the scale invariance of the ZEP feature. 
Slight variation is due to the pre- and post processing. 
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BiolD 




(a) (b) 

Figure 8. Maximum (best case - red dotted line), average (blue dashed line) and minimum (worst eye - black solid line) 
accuracy for eye localization on the BiolD database (left - (a)) and Cohn-Kanade database (right - (b)). 


Table 4. The variation of the results with respect to the Table 5. Percentage of correct eye localization of the pro¬ 
posed algorithm (ZEP encoding) compared with dimen- 


determination of the epochs parameters is done in a 
single cross of the initial vector (i.e. with complex¬ 
ity 0(N P ) = 0(284)), thus we expect the proposed 
method to be significantly faster. 

Indeed, the average value for computation time in¬ 
creases from 6 msec (using the proposed method) to 11 
msec (almost double) using PCA on a 300 x 300 face 
square. The lack of compression increases the duration 
to 24 msec per face square. 

5.3. Robustness to Noise 

An image projection represents a gray-scale average, 
hence it is reasonable to expect that the proposed 
method is very robust to noise. To study robustness 
to noise we have artificially added Gaussian noise to 
the BiolD images and we subsequently measured the 
localization performance for e < 0.1 accuracy. In¬ 
deed, while the noise variance increases from 0 to 30, 
the average accuracy decreases from 91.94% to only 
79.92%. The variation of the accuracy with respect 
to the added noise standard deviation may be seen in 
figure [9j Examples of images degraded by noise may 
be seen in figure [To] 


sionality reduction with PCA. 


Encoding Type 

Proposed 

PCA 

None 

Accuracy, e < 0.05 

70.46 

69.66 

72.97 

Accuracy, e < 0.10 

91.94 

92.70 

93.52 

Accuracy, e < 0.25 

98.89 

98.87 

99.07 


size of the eye analysis window. 


Crop size 

36 x 36 

71 x 71 

100 x 100 

Accuracy, e < 0.05 

64.00 

70.46 

64.20 

Accuracy, e < 0.10 

90.36 

91.94 

90.22 

Accuracy, e < 0.25 

98.61 

98.87 

99.14 


5.2. The Dimensionality Reduction 

The main visible effect of the proposed encoding is the 
reduction of the size of the concatenated projections. 
Yet, as we have adapted the encoding technique to the 
specific of the projection functions applied on the eye 
area, its performance is higher than of other methods. 
To see the influence of this encoding technique, we 
compared the achieved results with the ones obtained 
by reducing the dimensionality with PCA (as being the 
most foreknown such technique) by the same amount 
as the proposed one. The rest of the algorithm remains 
the same. The comparative results may be seen in 
table [5] We also report the results when no reduction 
was performed. 

The results indicate that both methods are lossy com¬ 
pression techniques and lead to decreased accuracy. 
The proposed method is able to extract the specifics 
of the eye from the image projections, as discussed in 
subsection |3.1[ being marginally better then the PCA 
compression. 

Furthermore, we take into account that the dimen¬ 
sionality reduction with PCA requires, for each con¬ 
sidered location, a matrix multiplication to project 
the initial vector of size N p (N p = 284) onto the fi¬ 
nal space (with size M p = 60), thus having the com¬ 
plexity 0(N p M p ) = 0(284 x 60). In comparison, the 
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Noise variance cs 


Figure 9. Variation of the localization performance for e < 
0.1 accuracy with respect to gaussian noise variance used 
for image degradation. 



Figure 10. Image from the BioID database degraded by 
noise. For a human observer it is quite difficult to de¬ 
termine the eye centers. Yet the results showed in figure [9] 
prove that our method performs remarkably well. 


5.4. Results on the BioID 

To give an initial overview of the problem in state 
of the art, we consider the results reported by other 
methods on the BioID database. Other solutions for 
eye localization are summarized in table [6j The results 
of the methods for localization of face fiducial points 
are showed in table [7] Visual examples of images with 
localized eyes produced by our method are shown in 
figure [llj 

Analyzing the performance, first we note that our 
method significantly outperforms in both time and 
accuracy other methods relying on image projections 

)). The ex¬ 
planation lies in the normalization procedure implied 
when constructing the ZEP feature. 

Comparison with face feature fiducial points localiza¬ 
tion is not straightforward. While such methods local¬ 
ize significantly more points than simple eye centers 
localization, they also rely strongly on the inter-spatial 
relation among them to boost the overall performance. 
Furthermore, they often do not localize eye centers, 
but eye corners and the top/bottom of the eye, which 
in many cases are more stable than the eye center (i.e. 
not occluded or influenced by gaze). And yet we note 
that our method is comparable in terms of accuracy 


((Zhou & Geng 2004), (Turkan et al. 2004 



Figure 11. Face cropped images from BioID database. Top 
row shows images with eyes correctly localized, while bot¬ 
tom row shows failure cases. 


and significantly faster (if one normalizes the reported 
time by the number of detected points). 


Regarding other methods for eye localization, the pro¬ 
posed method ranks as one of the top methods for 
all accuracy tests, being always close to the best 
solution. Furthermore taking into account that on 
BioID database there are only « 50 images (3%) with 
closed eyes, methods that search circular (symmetri¬ 
cal) shapes have better circumstances. Because we tar¬ 
geted images with expressions, we specifically included 
in our training data set closed eyes. To validate this 
assumption we tested with very good results on the 
Cohn-Kanade database showing that our method is 
more robust in that case as showed in figure [8] (b). 


Considering as most important criterion the accuracy 
at e < 0.05, we note that 20111 and 2 012 provide 
higher accuracy. Yet, we must also note and the 
highest performance achieved by a variation of the 
method described in (Valenti & Gevers, 2012), namely 
Val.&Gev.+SIFT contains a 10-fold testing scheme, 
thus using 9 parts of the BioID database for training. 
Furthermore, taking into account that BioID database 
was used for more that 10 years and provides limited 


variation, it has been concluded (Belhumeur et al. 


2011), (Dantone et al. 2012) that other tests are also 


required to validate a method. 


2012 provide results on other datasets and made 
public the associated code for their baseline system 
(Val.&Gev.+MS) which is not database dependent. 
|20lT do not provide results on any other database ex¬ 
cept BioID or source code, yet there is publicly avail¬ 
ably code developed with author involvement. Thus, 
in continuation, we will compare the hereby proposed 

2 At http://thume.ca/projects/2012/11/04/ 

simple-accurate-eye-center-tracking-in-opencv/ 
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Table 6. Comparison with state of the art (listed in chronological appearance) in terms of localization accuracy on the 


BioID database. The correct localization presents results reported by authors; values marked with were inferred 


from authors plot. While 2004 reports only the value for e < 0.25, the rest is reported by (Ciesla V Koziol 2012). The 


method marked with f relied on a 10-fold training/testing scheme, thus, at a step, using 9 parts of the BioID database 
for training. 



Accuracy 

Method 

e < 0.05 

e < 0.1 

e < 0.25 

Proposed 

70.46 

91.94 

98.89 


(Jesorsky et al.| 2000) 

40.0 

79.00 

91.80 

|Wu & Zhou;, 

2003) 


10.0* 

54.00* 

93.00* 

(Zhou & Geng 

|2004) 

47.7 

74.5 

97.9 

0 

Cristinacce et a] 

l. 200 

4) 

55.00* 

96.00 

98.00 

(< 

^ampadelli et a 

1. 200 

6) 

62.00 

85.20 

96.10 

(Hamouz et al. 

2005) 

59.00 

77.00 

93.00 

(Turkan et al., 

2004) 

19.0* 

73.68 

99.46 

(Kroon et al. J 

2008 


65.0 

87.0 

98.8 

(Asteriadis et al., 2001 

1) 

62.0* 

89.42 

96.0 

(Asadifard & Shanbezadeh 

[2010) 

47.0 

86.0 

96.0 

I 

JTimm & Barth, 2011 

.) 

82.50 

93.40 

98.00 

(Valenti & Gevers|, 2012)- 

PMS 

81.89 

87.05 

98.00 

(Valenti & Gevers[|2012)+SIFT f 

86.09 

91.67 

97.87 

(Florea et al., 2012) 


57.13 

88.97 

98.48 


Table 7. Comparison with methods that localize a multitude (~ 20) of points on the face on the BioID database. All the 
results were extrapolated from authors graphs. We note that in these cases eye localization was not the major objective 
and authors report an average error for the entire set of points; thus we also used average achieved error. The reported 
time is for determination of the entire set of points. 



Accuracy 

Time performance 

Method 

No. pts. 

e < 0.05 

e < 0.1 

Duration 

Platform 

Duration 

point 

Proposed 

2 

78.80 

96.04 

13 msec 

i7 2,7 GHz 

6.5 msec 

(Vukadinovic & Pantic 

2005) 

20 

15.0 

78.0 

n/a 

n/a 

n/a 

(Milborrow & Nicolls 

2008) 

17 

66.0 

95.0 

230 msec 

P4 3Ghz 

13.53 msec 

(Valstar et al. 20K 

i 

22 

74.00 

95.00 

n/a 

n/a 

n/a 

< 

Belhumeur et al. 

12011 


29 

62.00 

85.20 

^950 msec 

i7 3.06GHz 

32.75 msec 

< 

Mostafa & Farag 

2012 


17 

74.00 

96.00 

470 msec 

i7 2.93GHz 

27.64 msec 
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Table 8. Percentage of correct eye localization on the 
Cohn-Kanade database. We report results on the neutral 
poses, expression apex and overall and compare against 


method from (Valenti & Gevers 


2012), (Timm V Barth 


2011) and ( Ding V Martinez[ 2010). We marked with light 


gray background the best achieved performance for each 
accuracy criterion and respectively for each image type. 


Method 


Proposed 


(Valenti & Gevers, 2012) 


(Timm & Barth 2011) 


(Ding & Martinez 2010) 


Type 


Neutral 


Apex 


Total 


Neutral 


Apex 


Total 


Neutral 


Apex 


Total 


Neutral 


Apex 


Total 


e < 0.05 



Accuracy 


46.0 


35.1 


40.6 


66.0 


61.4 


63.7 


14.3 


11.8 


13.1 


954 


*gure 


92tetltabase. OSlfe 


94djkrk 


cropped images from the Cohn-Kanade 
ground truth eyes are marked with red 
grey*j),9j2hile detected eyes with green (light grey). 

how eyes correctly localized, while bot- 


90.2 


75.9 

7 ^agds 


m 

-die 


rider 

’icatd 


12 . 


r< P w "99?(5 S 

gttopi row sl^y| fajlure cases. 


96.2 


100 

we /show 


mrc. 


4dcP. m]jl< 


that our method performs very well 
iex conditions. Achieved results in- 


app r o xim ately a doubled accuracy when com¬ 


pared with the foremost state of the art method. 


method against these two on other datasets. Addition¬ 
ally, we include the comparison against the eye detec¬ 
tor developed by |201Q| which has also been trained and 
tested on other database, thus is not BioID dependent. 


5.5. Robustness to Eye Expressions 

As mentioned in the introduction, we are specifically 
interested in the performance of the eye localization 
with respect to facial expressions, as real-life cases 
with fully opened eyes looking straight are rare. We 
tested the performance of the proposed method on the 


Cohn-Kanade database (Kanade et al. 2000). This 


database was developed for the study of emotions, con¬ 
tains frontal illuminated portraits and it is challenging 
through the fact that eyes are in various poses (near- 
closed, half-open, wide-open). We tested only on the 
neutral pose and on the expression apex image from 
each example. The correct eye locations, with stan¬ 
dard precisions, are shown in table [8] Typical local¬ 
ization results are presented in figure [l2j while the 
maximum, average and minimum errors are plotted in 
figure [8] (b). 

We note that solutions that try to fit a circular or a 


symmetrical shape over the iris, like ( Valenti fe Gev- 
2012) or (Timm & Barth, 2011| ), and thus, perform 


ers 


well on open eyes, do encounter significant problems 
when facing eyes in expressions (as it is shown in table 
[8|. Taking into account the achieved results, which 
are comparable on neutral pose and expression apex 


5.6. Robustness to Illumination and Pose 


We systematically evaluated the robustness of the pro¬ 
posed algorithm with respect to lighting and pose 
changes. This was tested onto the Extended Yale Face 


Database B (B+) (Lee et al., 2005 

). We stress that 

part of the Yale B database ( 

Georghiades et al. 2001) 


was used for training the MLP for lateral illumina¬ 
tion, thus the training and testing sets are completely 
different. 


The Extended Yale B database contains 16128 gray¬ 
scale images of 28 subjects, each seen under 576 view¬ 
ing conditions (9 poses x 64 illuminations). The size 
of each image is 640 x 480. The robustness with respect 
to pose and with respect to illumination was evaluated 
separately. 

For evaluating the robustness to illumination, we 
tested the system on 28 faces, in neutral pose, un¬ 
der changing illumination (64 cases). The results are 
summarized in table [9j 

The system achieves reasonable results in the cases 
when even a human observer is not able to identify the 
eyes. As long as the illumination is constant over the 
eye, the system performs very well, proving the invari¬ 
ance to uniform illumination of the ZEP feature claim. 
Examples of localization while illumination varies are 
presented in figure [l3j 

For larger illumination angles, due to the uneven dis- 


















































































Robust Eye Centers Localization 



Figure 13. Face cropped images from the Extended YaleB 
database showing robustness to illumination. 


tribution of the shadows, the shape of the projections 
is significantly altered and the accuracy decreases. Ex¬ 
amples, with cases where the shades are too strong or 
inopportunely placed and we reach lower results, are 
showed in figure [14] 

To evaluate the robustness of the algorithm with re¬ 
spect to the face pose, we consider each of the 28 
persons with frontal illumination, but under varying 
poses (9 poses for each person). Pose angles are in the 
set {0°, 12°, 24°}, thus spanning the typical range for 
“frontal face”. The results are shown in table Ho] and 
visual examples in the figure [15] Taking into account 
that the maximum number of images that have the 
worst eye less accurate than 0.1 is 2, we may truth¬ 
fully say that the proposed method is robust to face 
pose. 


When compared with the method proposed in (Valenti 


& Gevers 1 2012), our solution performs marginally bet¬ 


ter. If we consider the results reported in the men¬ 
tioned paper, then the average result for accuracy at 
e < 0.1 is 88.07% computed on the smaller YaleB 
database (Georghiades et al. 2001) while our method 
reaches 89.85% on the same subset of azimuth and 
elevation illumination angles on the larger Extended 



A = —50°, A = + 50°, A = —50°, 

£ = 0° £ = -40° £ = 0° 


Figure 14. Face cropped images from Extended Yale B 
database showing illumination cases that reveal limitations 
of the method. Specific shapes of projections caused by 
light and shade make the system prone to errors. Illumi¬ 
nation angle is given by azimuth A and elevation £. 


Table 10. Pose variation studied on the Extended Yale B 
(B+) data set. The specified numerical values have been 
obtained for accuracy e < 0.1. For each category maximum 
two (corresponding to an accuracy of 92.86%) images were 
missed._ 


Azimuth 

24° 

12° 

Frontal 

Elevation 

Up 

96.43 

92.86 

96.43 

Neutral 

96.43 

96.43 

92.86 

Down 

92.86 

92.86 

96.43 


YaleB database (Lee et al. 2005). If we compare the 


full results on the entire Extended YaleB database (in¬ 
cluding extreme illumination cases) then our method 
outperforms with small margin for high accuracies as 
shown in table [Tl] Our method performs significantly 
better than the ones proposed by[201l|and respectively 
120101 


5.7. Accuracy in Real-Life Scenarios 

While BioID, Cohn-Kanade and Extended YaleB 
databases include specific variations as they are ac¬ 
quired under controlled lighting conditions with only 
frontal faces, they cannot be considered too closely re¬ 
sembling real-life applications. In contrast, there are 
databases like the Labeled Face Parts in the Wild 
(LFPW) ( |Belhumeur et al.' , 2011) and the Labeled 


Table 11. Comparative results on the Extended YaleB 
database 


Method 

e < 0.05 

e < 0.1 

e < 0.25 

Proposed 

39.9 

61.3 

91.3 

(Valenti & Gevers 

2012) 

37.8 

66.6 

98.5 

(|Timm V BarthJ 

2011) 

20.1 

34.5 

51.5 

(Ding V Martinez 

2010) 

19.7 

47.8 

58.6 
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Table 9. Illumination variation studied on the Extended Yale B (B+) data set. The numerical values have been obtained 
for accuracy e < 0.1. If more cases were available in the given interval, the average is reported. If the specific case does 
not exist in the database, ”n/a” is reported. 


Azimuth 

±[110° : 130°] 

±[70° : 90° ] 

±[50° : 60°] 

±[20° : 35°] 

±[5° : 15°] 

0° 

Elevation 

—40 u : —35 u 

n/a 

73.21 

58.93 

60.71 

n/a 

92.86 

—20 u : -10 u 

71.43 

76.79 

83.93 

84.82 

91.96 

96.43 

0 U 

42.86 

75.00 

65.50 

94.64 

91.07 

92.86 

10 u : 20 u 

67.86 

76.79 

75.87 

87.50 

95.54 

100 

40 u : 45 u 

75.00 

76.79 

n/a 

75.00 

n/a 

89.29 

65 u : 90 u 

82.14 

n/a 

n/a 

64.29 

n/a 

78.57 



displacement of the one marker. 


Examples of the results achieved on the LFW database 
may be seen in figure [16] Numerical results, com¬ 
pared with the solution from (Valenti & Gevers, 2012), 


(Timm & Barth, 2011), (Ding & Martinez, 2010) and 


with human evaluation error are presented in figure 

d 

Regarding the achieved results, we note that even that 
our method was designed to work on large resolu¬ 
tion faces, it provides accurate results when applied 
on smaller ones. As one can see in figure [IT] we sig¬ 
nificantly outperform the state of the art solutions 


(Valenti & Gevers, 2012) and from (Timm & Barth 


2011) by almost 50% improvement at e < 0.05 accu¬ 


racy, on an over 12000 image database that presents 
as close to real-life as possible cases and with more the 


method from (Ding & Martinez 2010). 


Figure 15. Face cropped images from Extended YaleB 
database showing robustness to pose. 


Faces in the Wild (LFW) ( [Huang et al. 2007), which 
are randomly gathered from the Internet, contain large 
variations in the imaging conditions. While LFPW is 
annotated with facial point locations, only a subset 
of about 1500 images is made available and contains 
high resolution and rather qualitative images. In op¬ 
position, the LFW database contains more than 12000 
facial images, having the resolution 250 x 250 pixels, 
with 5700 individuals that have been collected “in the 
wild” and vary in pose, lighting conditions, resolution, 
quality, expression, gender, race, occlusion and make¬ 
up. 

The images difficulty is certified by the performance of 


human evaluation error as reported in (Dantone et al. 


2012), which also provided annotations. While the 


ground truth is taken as the average of human mark¬ 
ings for each point normalized to inter-ocular distance, 
human evaluation error is considered as the averaged 


5.8. Algorithm Complexity 

The entire algorithm requires only four divisions for 
the projections normalization and two for determina¬ 
tion of the region weighting center with variable de¬ 
nominator per eye crop, and no high precision opera¬ 
tions, therefore needing only limited fixed point pre¬ 
cision. The ZEP+MLP combination is linear with re¬ 
spect to the size of scan eye rectangle 0(SN). The 
method was implemented in C around OpenCV func¬ 
tionality, on an Intel i7 at 2.7 GHz, on single thread 
and it takes 6 msec for both eyes on a face square of 
300 x 300 pixels, which is a typical face size for HD - 
720p (1280 x 720 pixels) format. We note that addi¬ 
tional 7 msec are required for Face Detection. 

The code can run in real-time while including face de¬ 
tection and further face expression analysis. Compar¬ 
ison with state of the art methods may be followed in 
the table [12] when comparing with other eye localiza¬ 
tion methods and on right hand side of table [7] when 
discussing face fiducial points localization solutions; 
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LFW - average accuracy 




(b) 


Figure 17. Results achieved on the LFW database: (a) Average accuracy and (b) minimum accuracy as imposed by the 
stringent criterion - eq. <§• With dashed blue line is the average error for human evaluation , black line - proposed 
method, red line - |2012[ green line - |2011| and with magenta line — |2Q1Q| 



Figure 16. Images from Labelled Face in the Wild (LFW) 
database. The top row show images having accuracy higher 
than 0.05, medium row accuracy between 0.05 and 0.1 and 
bottom row accuracy lower than 0.1. 


for some works the authors have not reported speed 
performance, but taking into account algorithm com¬ 
plexity, it is reasonable to presume that it is too large 
for real-time. 


Trying to overcome the difficulty of comparison while 
different platforms were used for implementation, we 
rely as a unifying factor on the single thread bench- 


marking score provided by ( 

PassMark Software Pty 

Ltd retrieved January 2015 

) for specific CPU; this 


score will be denoted by CPU S . It must be noted 


that such numbers should be considered with precau¬ 
tion since there do exist several CPUs that correspond 
to the description provided by authors (and we always 
took the best case) and the benchmark test may not 
be very relevant for the specific processing required by 
a solution. 


To aggregate the overall time performance of a method 
we used the following formula: 


T P = 


fps x min {M,N} 


CPU. 


power 


( 10 ) 


where M x N is the frame size used for reported re¬ 
sults. Note that the formula uses only one of the two 
dimensions that describe an image to cope with differ¬ 
ent aspect ratios. 


The results for eye localization aggregated with the 


measure in equation (10) are showed in the table 12 


when comparing with other eye localization methods. 
Our method rank second following the one proposed by 
2000, but it gives consistently better results in terms 
of accuracy. 


It has to be noticed that while, initially only |2012 


re¬ 
ported comparable computational time, after integrat¬ 
ing the larger frame size with processing power, our 
method turns to be 1.5 times faster. Furthermore, to 
be able to directly compare our computation time, we 
have modified the size of input face to be 120 x 120 
which corresponds to a 320 x 240 pixels image, let¬ 
ting everything else the same and we find out that our 
method requires 1.6 msec to localize both eyes; given 
the additional 7 msec for face detection, we get a total 
time of 8.6 msec, that is equivalent with a frame rate 
of 116 frames/sec, proving that we clearly outperform 
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Table 12. Comparison with state of the art (listed in chronological appearance) in terms of time requirements. Regarding 
time performance,we compare ourself only with papers that provide some measure of duration. Also we note that, in 


general, authors reported non-optimized results on PC platforms and various image sizes. The reported time for (Asadifard 


V Shanbezadeh 2010) is taken from (Ciesla V Koziol 2012). With gray background we marked the best state of the art 


result. T p is given by equation [10] and higher values are desired. 


Time performance 


Proposed 


(Turkan et al. 2004 ) 


(Kroon et al., 2008 


(Asteriadis et al. , 2009) 
(Asadifard & Shanbezadeh 2010) 


JValenti fe Gevers 

^ Valenti fe G e vers j 

(Florea et a 


2012 )+MS 


2012)+SIFT 


2012 ) 


(Jesorsky et al., 

2000) 


(Zhou & Geng[ 

2004$" 



Cristinacce et al, 

. 200 z 

i) 

< 

Campadelli et al 

to 

o 

o 

3) 

(Hamouz et al., 

2005) 


FrameRate 

Image size 

Platform 

CPU, 

T P 

76 fps 

1280x 720 

i7 2.7 GHz 

1747 

31.23 

33 fps 

384 x 288 

P3 850MHz 

« 200 

47.52 

15fps 

320 x 240 

Core 2 

981 

3.67 

0.7 fps 

384 x 286 

P3 500MHz 

140 

1.42 

0.08 fps 

384 x 286 

P4 3.2GHz 

720 

0.032 

0.07 fps 

720 x 576 

P4 2.8GHz 

618 

0.065 

12fps 

384 x 286 

n/a 

n/a 

n/a 

2fps 

n/a 

n/a 

n/a 

n/a 

3.84 fps 

384 x 286 

PM 1.6 GHz 

514 

2.13 

15 fps 

320 x 240 

Core 2 

981 

3.67 

90 fps 

320 x 240 

Core2 2.4GHz 

981 

22.01 

29 fps 

320 x 240 

Core2 2.4GHz 

981 

7.09 


the method from (Valenti & Gevers, 2012). 


5.9. Discussion 


The previous subsections within this ” Results and dis¬ 
cussions” part have guided through various experi¬ 
ments and measurements that present a through com¬ 
parison of eye localization performance of the here pro¬ 
posed ZEP eye localization method, which is shown to 
perform remarkably well among a wide variety of con¬ 
ditions and datasets. Some of the presented numbers 
and experiments deserve yet a supplemental emphasis 
and clarification. 


A first issue of discussion is related to the experimen¬ 
tal setup, namely the databases that are currently used 
in the assessment of algorithms accuracy. BioID has 
gained through the years widespread recognition, as 
it was one of the earliest face image databases that 
contain facial landmark ground truth annotations. As 
such, BioID was intensively used for accuracy compar¬ 
isons, with a clear tendency over time to concentrate 
the efforts in getting top results on BioID alone. As 
one may have noticed in table [6j the here proposed 
method is outperformed on BioID by the algorithms 
proposed by|2011 and 2012. 


We can notice that most the methods are overtrained 
in standard conditions, and thus perform very well 
within their over-learned domain. As such, we claim 
that these approaches are not relevant in a broader, 
real-life testing scenario. The approaches proposed by 


2012 and 2011 are thus retained as a significant eye 


location methods and we further tested them; we also 
included the solution from (Ding & Martinez, 2010) 
as being a high profile method build outside the BioID 
database; the results showed that we outperform these 
methods by a gross margin. 


As anyone knowledgeable in the field observes, the 
BioID database contains mostly frontal pose, frontal 
illumination and neutral expression faces, and catches 
only a small glimpse of the problems related to eye lo¬ 
calization. As such, intensive performance comparison 
must be realized outside these standard conditions, as 
2012 does in the case of varying illumination and pose 
and we do in the case of noise, variable illumination, 
expression and pose variations. Several tests that are 
reviewed again here prove the superior performance 
of the proposed ZEP eye localization method in these 
extreme conditions. 


The non-frontal illumination and the subject pose vari¬ 
ations are key issues in real-life, unconstrained applica¬ 
tions. Typically these are tested within the Extended 
YaleB database, where ZEP performs marginally bet¬ 


ter (+2%) than method in (Valenti & Gevers 


and significantly better than (Timm & Barth 


2012 ) 


2011 ) 


and ( |Ding fe Martinet 2010), as shown in Table 11 


Subject emotional expressions hugely affect eye shape 
and surroundings. The Cohn-Kanade database is the 
state of the art testbed in emotion-related tasks; in 


this case, the ZEP eye localization outperforms (Timm 


& Barth 2011) by 10%, (Valenti & Gevers 2012) by 


some 30%, and (Ding & Martinez, 2010) by 60% as 
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shown in table [HI 

Within all databases, closed eyes present an indepen¬ 


dent challenge. As noticed by the authors in (Valenti 


& Gevers, 2012), their method is prone to errors in 


detecting the closed eye center (which is confirmed by 
the experiments across all databases). The proposed 
ZEP method is much more robust to closed eyes, due 
to the way in which the eye profile is described within 
the proposed encoding of the luminance profiles. 

Finally, we consider that the most relevant test is per¬ 
formed on the LFW database, taking into account the 
size (more than 12000 images), image resolution (ex¬ 
tremely low) and especially the fact that images were 
acquired “in the wild”. Yet, on the LFW database, 
which is currently one of the most challenging tasks, 
we outperform the method in (Timm & Barth 2011) 
by at least 5% the one in ( | Valenti fe Gevers , 2012) by 
a gross margin (+13%) and respectively the method 
form ( |Ding fc Martinez"} |2010 ) by near 30%. Nonethe¬ 
less we much closer to human accuracy (as shown in 
figure [l7] (a) ). 

Regarding the computational complexity, the here pro¬ 
posed method requires a computational time which is 
inferior to the time required by the method from (Je- 


sorsky et ah, 2000); yet the accuracy of the here pro¬ 


posed method is significantly higher. If we compare 
only the computation time, without considering the 


image size, one may consider the method from (Valenti 


& Gevers, 2012) to be faster. Yet, tests showed that 


the here proposed solution is still faster than the im¬ 
plementation from ( |Valenti fc Gevers] 2012) at equal 
image resolution (namely 320 x 240). We thus claim 
that the here proposed method is the fastest solution 
from the select group of high accuracy methods. 


6. Conclusions 

In this paper, we proposed a new method to estimate 
the eye centers location using a combination of Zero- 
based Encoded image Projections and a MLP. The 
eye location is determined by discriminating between 
eyes and non-eyes by analyzing of the normalized im¬ 
age projections encoded with the zero-crossing based 
method. The extensive evaluation of the proposed ap¬ 
proach showed that it can achieve real-time high accu¬ 
racy. While the ZEP feature was used for eye descrip¬ 
tion, we consider that it is general-enough and may be 
used in numerous problems. 
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